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Abstract 

When a data-parallel language like Fortran 90 is com- 
piled for a distributed-memory machine, aggregate data 
objects (such as arrays) are distributed across the processor 
memories. The mapping determines the amount of residual 
communication needed to bring operands of parallel opera- 
tions into alignment with each other. A common approach 
is to break the mapping into two stages: first, an alignment 
that maps all the objects to an abstract template, and then a 
distribution that maps the template to the processors. 

We solve two facets of the problem of finding align- 
ments that reduce residual communication: we determine 
alignments that vary in loops, and objects that should have 
replicated alignments. We show that loop-dependent mo- 
bile alignment is sometimes necessary for optimum perfor- 
mance, and we provide algorithms with which a compiler 
can determine good mobile alignments for objects within do 
loops. We also identify situations in which replicated align- 
ment is either required by the program itself (via spread 
operations) or can be used to improve performance. We 
propose an algorithm based on network flow that deter- 
mines which objects to replicate so as to minimize the total 
amount of broadcast communication in replication. This 
work on mobile and replicated alignment extends our ear- 
lier work on determining static alignment. 


1 Introduction 

Parallelism is expressed in data-parallel array languages 
like Fortran 90 [1] in the form of operations on arrays and 
array sections. Compiling such a program for a distributed- 
memory parallel machine requires a model for the mapping 
of the data to the machine. We view the mapping as an 
alignment to a Cartesian index space called a template y 
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followed by a distribution of the template to the proces- 
sors. The alignment phase positions all array objects in the 
program with respect to each other so as to reduce realign- 
ment communication cost. In the distribution phase that 
follows, the template is distributed to the processors. This 
two-phase approach separates the language issues from the 
machine issues, and is used in Fortran D [7], High Perfor- 
mance Fortran [10], and CM-Fortran [16]. 

The goal of compilation is to produce data and work 
mappings that reduce completion time. Much of this goal 
can be achieved by judicious alignment of the arrays. We 
consider only alignment here. 

Completion time has two components: computation and 
communication. Communication can be separated into in- 
trinsic and residual communication. Intrinsic communica- 
tion arises from computational operations such as reduc- 
tions that require data motion as an integral part of the 
operation. Residual communication arises from nonlocal 
data references required in a computation whose operands 
are not mapped to the same processors. As we only con- 
sider alignment in this paper, we take the view that objects 
are mapped identically to processors if and only if they are 
aligned. We use the term realignment to refer to residual 
communication due to misalignment; we seek to determine 
array alignments that minimize realignment cost. Commu- 
nication for transpose, spread, and vector-valued subscript 
operations can in some cases be removed by suitable align- 
ment choices. Our theory makes these forms of communi- 
cation residual rather than intrinsic, and thus encompasses 
such optimizations [5]. 

A suitable alignment for the code fragment of Figure 1 (a) 
is shown in Figure 1(b). Note that V moves at each iteration 
of the loop; it has a mobile alignment. 

In this paper, we present algorithms to automatically de- 
termine good mobile alignments. We develop a detailed 
and realistic model of realignment cost that accounts for 
control flow in loops, and we formulate the alignment prob- 
lem as a constrained optimization of the realignment cost. 
We present approximate solutions for mobile stride and 
offset alignment for array objects occurring within loops, 
where we allow the offset alignment to be a compiler- 
determined affine function of loop induction variables. We 
also show that replication may be viewed as an extension of 
offset alignment, and show that the problem of determining 



real *(100,100), V(200) 


do k * 1, 100 

A(k, 1:100) = A(k, 1:100) + V(k:k+99) 
enddo 


real *(100,100), V(200) 
template T 

align A(i,j) with T(i,j) 
do k = 1, 100 

realign V(i) with T(k,i-k+l) 

A(k, 1:100) = A(k, 1:100) + V(k:k+99) 
enddo 


(a) 


(b) 


Figure 1 : (a) A Fortran 90 program fragment requiring mobile alignment, (b) A mobile alignment for the program fragment. 


the optimal replication strategy can be reduced to a network 
flow problem. 

Several other authors have considered static align- 
ment [2, 9, 12, 13, 17], Our earlier research [4, 5, 8] 
dealt with static alignment. We extend that work to handle 
mobile alignment here. Knobe, Lukas, and Steele [12] and 
Knobe, Lukas, and Dally [11] address the issue of dynamic 
alignment. Their notion of dynamic alignment is alignment 
depending on quantities whose values are known only at 
runtime, which may include loop induction variables as 
well as other arbitrary runtime values. This paper focuses 
on mobile alignment in the context of loops, where the 
alignment of an object is an affine function of the loop 
induction variables. 

The paper is organized as follows. Section 2 formalizes 
the notion of alignment and defines mobile alignment. It 
also introduces our graph model for the alignment problem. 
Section 3 poses and solves the problem of mobile stride 
alignment. Section 4 poses and solves the problem of 
mobile offset alignment, covering fixed- and variable-sized 
objects and loop nests. Section 5 describes an algorithm for 
determining replicated offset alignments. Finally, Section 6 
presents conclusions, open problems, and future work. 

2 The alignment problem 

An alignment is a mapping that takes each element of 
an array to a cell of a template. The template is a con- 
ceptually infinite Cartesian grid, with as many dimensions 
as necessary; it is a piece of “graph paper” on which all 
the array objects in a program are positioned relative to 
each other. The alignment phase of compilation aligns all 
array objects of the program to the template. The distribu- 
tion phase then assigns template cells to actual processors. 
This paper discusses only the alignment phase. 

If A is a d-dimensional array, and g\ through g t are 
integer-valued functions, we write 

A(ii , . . . , q) B , . . . , itf ) , . . . , , • • • , id)] 


to mean that the specified element of A is aligned to the 
specified element of the < -dimensional template T. Multi- 
ple templates may be useful in some cases, but this paper 
only considers alignment to a single template. Thus we 
omit the template name and just write A(«) EB [ff(i)], where 
i is ad-vector and g is afunction from d-vectors to t-vectors. 

We restrict our attention to alignments in which each axis 
of the array maps to a different axis of the template, and 
array elements are evenly spaced along template axes. Such 
an alignment has three components: axis (the mapping of 
array axes to template axes), stride (the spacing of array 
elements along each template axis), and offset (the position 
of the array origin along each template axis). Each gt is 
thus either a constant /* or a function of a single array index 
oftheformst» ail +/*. Thearray is aligned one-to-one into 
the template. (In Section 5, we extend this to one-to-many 
alignments in which an array can be replicated across some 
template axes.) 

An array-valued object (object for short) is created by 
every array operation and by every assignment to a section 
of an array. The compiler determines an alignment for each 
object of the program rather than to each program variable. 
The alignment of an object in a loop may be a function of 
the loop induction variable; such an alignment is mobile. 

2.1 Examples 

We now give examples of the various kinds of alignment. 

Example 1 (Offset alignment) Consider the statement 

A(1:I-1) = A(1 : H-l) + B(2:H). 

If the alignments are A(»)E3[t] and 5(»)EB[i], then aone-unit 
nearest-neighbor shift is necessary. However, the statement 
can be executed without communication if .4(f) 53 [i] and 
1 ]. 

Example 2 (Stride alignment) Consider the statement 
A(1:H) = A ( 1 : H ) + B(2 : 2*N : 2) . 
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If A(«) EB [*] and B(i) EB [«], then general communication 
is needed to bring A and the section of B together. The 
alignments A(«) EE [2i] and B(i) EB [i] avoid communication. 


Example 3 (Axis alignment) Consider the statement 
B = B + transpose(C), 

where B and C are two-dimensional arrays. If B(i \ , i 2 ) EB 
[i,,t 2 ] and C(*i , » 2 ) EB [»i, *2], then general communication 
is needed to transpose C. However, if B(i\, i 2 ) EB [* 1 , * 2 ] 
and C(tj, * 2 ) ffl [t' 2 , it], then the operands are aligned, and 
no communication is necessary. 

Example 4 (Mobile offset alignment) Consider the code 
fragment in Figure 1. This can be executed optimally if 
-A(*i , * 2 ) ffl* [*i , * 2 ], and V(t'i) EB* [fc,t'i - k + 1], We 
use the symbol St to emphasize the dependence of the 
alignment on the loop induction variable k. 

Example 5 (Mobile stride alignment) Consider the code 
fragment 

real A(1000), B(1000), V(20) 

do k * 1, 50 

V = V + A(l:20*k:k) 

B(l:20*k:k) = V 
enddo 

Suppose A(t) St [t] and B(i) St [*]. If the stride alignment 
of V is static, then any alignment of V is equally good, with 
a cost of two general communications per iteration. The 
cost drops to one general communication per iteration with 
the mobile stride alignment V(») EB* [kt]. 

22 Alignment-distribution graphs 

Our main tool in this paper is a modified and anno- 
tated data flow graph that we call the alignment-distribution 
graph , or ADG for short. In this section we briefly describe 
the ADG and formulate the alignment problem as an op- 
timization problem on the ADG. A companion paper [3] 
presents a more formal and complete treatment of the ADG. 
The ADG is closely related to the static single-assignment 
form of programs developed by Cytron et al. [6]. Figure 2 
shows the ADG for the program fragment in Figure 1 . 

Nodes in the ADG represent computation; edges repre- 
sent flow of data. Alignments are associated with endpoints 
of edges, which we call ports. A node constrains the rel- 
ative alignments of the ports representing its operands and 
its results. An edge carries residual communication cost if 
its ports have different alignments. The goal is to provide 
alignments for the ports that satisfy the node constraints 
and minimi ze the total edge cost. 



Figure 2: The ADG corresponding to the program fragment 
of Figure 1. 
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2*2.1 Edges 

The ADG has a port for each (static) definition or use of an 
object. An edge joins the definition of an object with its 
use. Multiple definitions or uses are handled with merge, 
fanout, and branch nodes as described below. Thus every 
edge has exactly two ports. The purpose of the alignment 
phase is to label each port with an alignment. All communi- 
cation necessary for realignment is associated with edges; 
if the two ports of an edge have different alignments, then 
the edge incurs a cost that depends on the alignments and 
the total amount of data that flows along the edge during 
program execution. 


222 Nodes 

Every array operation is a node of the ADG, with one port 
for each operand and one port for the result. Figure 2 
contains examples of a “+” node representing elementwise 
addition, a Section node whose input is an array and whose 
output is a section of the array, and a SectionAssign node 
whose inputs are an array and a new object to replace a 
section of the array, and whose output is the modified array. 
(SectionAssign is called Update by Cytron et ai [6].) 

When a single use of a value can be reached by multiple 
definitions, the ADG contains a merge node with one port 
for each definition and one port for the use. (This node 
corresponds to the ^-function of Cytron et ai [6].) When 
a single definition reaches multiple uses within the same 
basic block, the ADG contains a fanout node. When a 
single definition can reach multiple alternate uses (< e.g ., due 
to conditional constructs), the ADG contains a branch node. 
Figure 2 contains examples of merge, fanout, and branch 
nodes. Fanout nodes represent opportunities for so-called 
Steiner optimization , as discussed in Section 6. Finally, the 
ADG for a program with loops contains transformer nodes 
that delimit iteration spaces as described below. 

Nodes constrain the alignments of their ports. An ele- 
mentwise operation like “+” constrains all its ports to have 
the same alignment. A merge or fanout node enforces the 
same constraint. If A is a two-dimensional array in a two- 
dimensional template, a node transpose ( A ) constrains its 
output to have the opposite axis alignment from its input; 
thus any communication necessary to transpose the array is 
assigned to the input or output edges rather than to the node 
itself. Section and SectionAssign nodes enforce constraints 
that describe the position of a section relative to the position 
of the whole array; for example, the node for the section 
A ( 10 : SO : 2) constrains its output object to have the same 
axis as its input, twice the stride of its input, and an offset 
equal to 10 times the stride of A plus the offset of A. 


2 23 Iteration spaces 

The ADG represents data flow, not control flow. To model 
communication cost accurately, we must account for the 
fact that data can flow over a particular edge many times 
during the program’s execution, and each time the data 
object may have a different size. Section 6 discusses how 
to model arbitrary control flow. Here we deal with the 
important special case in which the only control flow is in 
the form of do loops. 

An edge inside a nest of k loops is labeled with a k- 
dimensional iteration space , whose elements are the vec- 
tors of values taken by the loop induction variables (LIVs). 
Both the size of the data object on an edge and the align- 
ment of the data object at a port are functions of the LIVs, 
so they may vary over the iteration space. 

For every edge that carries data into, out of, or around a 
loop, we insert a transformer node to describe the relation- 
ship between the iteration spaces at the two ports. Figure 2 
contains examples. A loop-back transformer node, in a 
loop do k = 1 : h : a , constrains the alignment of its input 
as a function of it + s to equal the alignment of its output as 
a function of k . Consider a ( 1 , k{ 1 , k + 1 ) transformer node 
as in Figure 2. An offset alignment of 2k + 3 on the input 
(“Jb”) port and of 2ib + 1 on the output (“ fc -h 1 ”) port satisfies 
the node’s constraints. The (1/1,1) transformer node on 
entry to this loop constrains its input position (which does 
not depend on k ) to equal its output position for k = 1 . 

2*3 Cost model 

Finally, we describe the communication cost of the pro- 
gram in terms of the ADG. A position is an encoding of 
a legal alignment. The distance d(p , q) between two po- 
sitions p and q is a nonnegalive number giving the cost 
per element to change the position of an array from p to q. 
The set of all positions is a metric space under the distance 
function d [4], 

In this paper we will use two metrics: the discrete metric , 
in which cf(p, q) = 0 if p = q and d(p } q) = 1 otherwise, 
and the grid metric , in which p and q are grid points and 
d(p, q) is the L\ (or Manhattan) distance between them. 
We use the discrete metric to model axis and stride align- 
ment, since any change of axis or stride requires general 
communication. The discrete metric is a simple model 
of general communication that abstracts away from such 
machine-specific details as routing, congestion, and soft- 
ware overhead. We use the grid metric to model offset 
alignment. The grid metric is separable , meaning that the 
distance between two points in a multidimensional grid is 
equal to the sum of the distances between their correspond- 
ing coordinates in one-dimensional grids. This property 
allows allows us to solve the offset alignment problem in- 
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dependently for each axis [4]. 

We model the communication cost of the program as 
follows. Let E be the set of edges of the ADG, and let 
I xy be the iteration space for edge (x, y). For a vector i in 
l xy , let w zy (i) be the data weight, which is the size of the 
data object on edge (x, y) at iteration i. Finally, let 7r be 
a feasible mobile alignment for the program — that is, for 
each port x let n x (i) be an alignment for x at iteration i 
that satisfies all the node constraints. Then the realignment 
cost of edge (x, y) at iteration i is u> xy (i) • d(n z (i), 7r y (i)), 
and the total realignment cost of the program is 

C(tt) = Yj “>*»(*) *»(*))• (!) 

(jc ,y )6 £7 

Our goal is to choose tt to minimize this cost, subject to the 
node constraints. 

2.4 Restrictions on mobile alignment functions 

So far we have not constrained the form that mobile 
alignments may take. In principle, we could allow them to 
be arbitrary functions of the LIVs. For reasons of tractabil- 
ity, we consider only the (important) case in which mobile 
alignments of objects to be affine functions of the LIVs. 
Thus, the mobile offset or stride alignment function for an 
object within a Jfc-deep loop nest with LIVs i\ , . . . , it is 

of the form ao 4 at\i\ 4- 1- a*n, where the coefficient 

vector a = (a 0 , . . . , a*) is what we must determine. We 
write this alignment succinctly in vector notation as at r , 
wherei = (l,i|,. ..,**). Both a and i are (£ 4 l)-vectors. 
This reduces to the constant term a 0 for an object outside 
any loops. 

Likewise, we restrict the extents of objects to be affine 
in the LIVs, so that the size of an object is polynomial in 
the LIVs. 


3 Mobile stride alignment 

We use the discrete metric to model communication 
costs arising from stride changes. Let the strides at the 
ports of an edge be ai T and a'i T . If a = a', then the ports 
will be aligned at every iteration; if the constant terms a 0 
and <*q differ but all other components are equal, then they 
are always misaligned; otherwise, they are almost always 
misaligned. We approximate this situation by considering 
the objects to be misaligned in all iterations unless a = a'. 

As the distance function in equation (1) is independent 
of the LIV, we can move it outside the summation over 
the iteration space, and write the communication cost of 
edge (x, y) as the product of a weight and a distance. The 
distance is the discrete metric on (k + 1 )-vectors; the weight 


is the sum over all iterations of the size of the object at 
each iteration, W = w ry (i). Since the weight is 

polynomial in the LIVs, the sum can be evaluated in closed 
form. We can now use compact dynamic programming, a 
technique we have previously developed for static axis and 
stride alignment [5], to solve this problem. 

4 Mobile offset alignment 

Consider an object with offset alignment at T . Since the 
problem is separable, we can determine offsets with respect 
to one template axis at a time. If there are no loops in the 
code, the solution reduces to our earlier solution for static 
offset alignment [5], 

The contribution of edge (x, y) to the residual commu- 
nication is 

C xy = ^2 u> xy (i)|(a - a')* T |> ( 2 ) 

* €Zjry 

where 7r x (t) — ai 7 , 7r y (i) = a'l 7 *, and I xy is the iteration 
space associated with die edge. Even if w xy (i) is constant, 
the absolute value in equation (2) makes its closed form 
complicated. Rather than seek an algorithm to minimize 
this cost function, we choose instead to approximate it 
by one for which the solution is straightforward. After 
reviewing the solution for static offset alignment, we show 
the solution for fixed-size objects in singly-nested loops 
(Jfc = 1), and then generalize to variable-size objects and to 
loop nests. 

4.1 Offset alignment by linear programming 

We review how the static offset alignment problem for 
the grid metric can be reduced to linear programming [5]. 
Let the integer i r x be the offset alignment of port x. Then 
the residual communication cost (which is the function we 
want to minimize) is C(tt) = Y^(x,y)tE C xy (j)\ so 

c\n) = ^2 U>xy\*t - *y\- 

(x t y)£E 

Nodes introduce linear constraints relating the offsets of 
their ports. See [3] for more details. To remove the ab- 
solute value from the objective function, we introduce a 
variable 6 xy for every edge (x, y) of the ADG, and add two 
inequality constraints, 

0 X y + 7T X - 7 r y > 0 

$ X y *“ 7T x -f- 7T y ^ 0, 

that guarantee that 0 xy > \ir x - 7r y |. The new objective 
function is then 

E w xy8*v 

(x t y)£E 
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Hie transformed problem is equivalent to the original one, 
because 0 xy = | ?r* - ir y | a t optimality. This transformation 
introduces |£| new variables and 2\E\ new constraints. 

If the offsets that result from the linear program are frac- 
tional, we round them to integers. The rounded solutions 
are not necessarily optimal integer solutions; in general, 
rounding an LP solution may not even preserve feasibil- 
ity. However, in the case of offset alignment with the grid 
metric, we argue that rounding is a reasonable approach. It 
is straightforward to round the offsets so as to satisfy all 
the node constraints. The template can be thought of as 
a discrete approximation to a continuous L\ metric space 
in which the edge costs are continuous functions of real- 
valued offsets. The unrounded LP optimizes this problem 
exactly, so we expect that the discrete optimum is not very 
sensitive to rounding. We will refer to this algorithm as 
rounded linear programming, or RLP. (We have also exper- 
imented with using mixed integer linear programming.) 

42 Fixed-size objects and singly-nested loops 

Assume for this section that the data weight of edge 
(*, y) is constant and equal to 1, and that l xy = t\h:s. 
Call (a — a')* 7, the span of edge ( x , y) at iteration i. If 
the span does not change sign in the interval [f , h] (as 
shown in Figure 3(a)), the summation and the absolute 
value in equation (2) can be interchanged. Then C xy -- 
I Yliet ■. h ■. ,( a _ the closed form for which is 

__ h — f "f ^ f "f A, ,,, 

C xy = |(a 0 — a 0 ) H — — (<*i — a,)|. (3) 

s z 

Note that the term inside the absolute value is the average 
distance spanned by edge (x, y). We can reduce this to 
RLP with one new variable per edge. 

In general, however, the span may change sign in the 
iteration space, and interchanging the summation and the 
absolute value is incorrect, as shown in Figure 3(b). In 
this case, we partition the iteration space into m equal sub- 
ranges I], . . . ,2 m , each subrange corresponding to a set of 
consecutive iterations, and decompose the communication 
cost as follows: 

m 

c* f =EEi(“-“')' T i- (4) 

; = 1 •€!; 

We then pretend that the span does not change sign within 
any subrange, which leads to the approximate cost model 

m 

C xy ttd xy = '52\'£(a- a')i T |. (5) 

j=\ ieij 

Now we fix m, expand the outer sum explicitly, and eval- 
uate each inner sum using equation (3), as shown in Fig- 
ure 3(c). Clearly, the span can change sign in at most one 





Figure 3: Approximating the cost of communication in 
loops. The actual communication cost is equal to the area 
under the heavy curve, (a) If the communication function 
does not have azero crossing, then ABDC = ABGE , and 
our approximation is exact, (b) If the communication func- 
tion has a zero crossing, then ABD + BCE ^ ACGF . 
The maximum relative error in approximation occurs when 
B coincides with H , and is proportional to AC. (c) To re- 
duce the maximum relative error, we partition the iteration 
space AC into subranges AP y PQ> and QC. A s there are 
no zero crossings in subranges PQ and QC , the approx- 
imations there are exact. The approximation in subrange 
AP is incorrect, but the maximum relative error is reduced. 
In general, at most one of the subranges can have a zero 
crossing. 
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subrange; therefore, at least (m — 1) of the subrange sums 
are correct. We then reduce to RLP with m new variables 
per edge. 

We now bound the error. We can show that the cost 
C at the approximate solution exceeds the cost at the best 
possible solution by at most a factor of ( 1 +2/m 2 ). (We can 
further reduce the error bound by using unequal intervals.) 

The discussion above suggests several possible algo- 
rithms for solving the mobile offset alignment problem, 
which we now list. 

1. Unrolling: Make every iteration a subrange, and use 
RLP. This is equivalent to unrolling the loop. It is 
exact, but is impractical unless the number of iterations 
is small. 

2. State space search: Approximate the iteration space 
as a single subrange, and use RLP. Using this solution 
as an initial guess, optimize the exact cost equation (4) 
by, for example, steepest descent. 

3. Tracking zero crossings: Split the iteration space 
into two equal subranges, and use RLP. If the span has 
a zero crossing in the range, locate it, and move the 
subrange boundaries to coincide with this point. Now 
solve the new RLP and iterate until convergence. This 
solves a sequence of fixed-size problems, each with 
2\E\ new variables. Convergence of this method is 
not guaranteed. 

4. Recursive refinement: Approximate the iteration 
space as a single subrange, and use RLP. Now ex- 
amine the solution to determine subranges (at most 
one per edge) in which the span has a zero crossing. 
Break each subrange in two at the zero crossing, and 
formulate and solve a new RLP. Continue the refine- 
ment until some stopping criterion is satisfied ( e.g. y 
there are no more subranges to be refined, the ob- 
jective function shows no further improvement, we 
run out of time). This requires solving a sequence of 
progressively larger problems. 

5. Fixed partitioning: Partition the iteration space into 
three subranges, and use RLP. The solution is guar- 
anteed to be within 22% of optimal. This requires 
solving a single problem with 3|f?| new variables. (A 
five-way partition would reduce the error bound to 
8 %.) 

We advocate the fixed partitioning method as a good com- 
promise between speed, reliability, and quality. 

43 Variable-size objects in singly-nested loops 

Now suppose that l xy = l : h : s and that the data weight 
of edge (z, y) at iteration i is fo + /M, where /?o and A 


are integer constants. Then the communication cost of the 
edge is 

c xy = X 

i€l: h : * 

Assuming the span does not change sign in [f, h], we 
can write the co mmuni cation cost of edge (*, y) as 

C X y = KA^l + A^oXaO — a o) + (/?l <T 2 + A)G'l)(<*l — «i)li 

where cr 0 = Ei e /:A:. ^ ‘ri = £<€<:/.:»*• 30(1 = 

e i • h t * 2 can evaluated in closed form: 

<r 0 = (h-£ + s)/s. 

<7 1 = (s<7q + {21 — s)<7o)/2. 

<7 2 = (2s 2 <rl + (6 st - 3s 2 )<7q + (6t 2 - 6 si + s 2 )<7 0 )/6. 

We then determine the alignment coefficients as in Sec- 
tion 4.2. 

4.4 Loop nests 

The method generalizes to loop nests as follows. Divide 
the index range for each LIV into three subranges. The 
Cartesian product of this decomposition divides the itera- 
tion space into 3* subranges, over each of which we assume 
that there is no sign change in the span; we sum the cost over 
each subrange, yielding one term in the approximate cost. 
We then solve for the minimizer of the approximate cost as 
in Section 4.2. It is also possible to use other quadrature 
rules to approximate the cost over each subrange. 

Forafc-deeploopnest, the problem has 3 fc |£7| variables. 
This technique will therefore not scale well for deep loop 
nests. We do not expect this to be a problem for Fortran 90, 
where array operations and lorall loops are used to ex- 
press in parallel what would be loop code in a sequential 
language. 

The Cartesian product formulation handles imperfect 
and trapezoidal loop nests quite naturally. The key to this 
is the transformer nodes that bridge the different levels of 
the loop nest. 


5 Replication 

Until now we have considered alignment as a one-to-one 
mapping from an object to the template. We now relax our 
definition and make it a one-to-many mapping, introducing 
the notion of replication. We define replication as an offset 
alignment that is a set of positions rather than a single 
position. We restrict the possible sets of positions to be 
triplets 1 : h : s. 
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A d-dimensional object aligned to a f -dimensional tem- 
plate has d body axes (which require axis, stride, and offset 
alignments) and (t - d ) space axes (which require only off- 
set alignments). Our notion of replication allows the offset 
alignment along a space axis of an object to be a regular 
section of the corresponding template axis. We use the 
symbol * to indicate replication across an entire template 
axis. For example, A(i) ES [i, 10] aligns A with one posi- 
tion along the second template axis; A(i) S3 [i, 10:20:2] 
aligns A with a subset of the second template axis; and 
A(i) S3 [i, *] replicates A across all of the second template 
axis. A broadcast communication occurs on an edge along 
which data flows from a fixed offset to a replicated offset. 

5.1 Replication labeling 

Offset alignment begins with a phase called replication 
labeling , whose purpose is to decide which ports of the 
ADG should have replicated positions. In this section, we 
propose an algorithm for replication labeling. Our algo- 
rithm labels ports as being replicated or non-replicated, but 
does not determine the extent of replication. Instead, we 
plan to generate the extents of replicated alignments in a 
storage optimization phase that follows replication. 

There are three sources of replication: 

• A spread operation causes replication. 

• The use of lookup tables indexed by vector-valued 
subscripts is more efficient if the lookup table is repli- 
cated across the processors; we will replicate them 
with the programmer’s permission. 

• A read-only object with mobile offset alignment in a 
space axis can be realized through replication. 

Subject to these sources, we want to determine which other 
objects should be replicated, in order to minimize broadcast 
communication during program execution. We model the 
problem as a graph labeling problem with two possible 
labels (replicated, non-replicated) and show that it can be 
solved efficiently as a min -cut problem. 

Figure 4 shows why replication labeling is useful. In the 
example, a broadcast will occur in every iteration if A is 
not replicated, while a single broadcast will occur (at loop 
entry) if it is replicated. This is the solution found by our 
method. 

After replication labeling, we discard from the ADG 
every edge with a replicated endpoint and proceed to find 
offsets for the non-replicated ports as described in Sec- 
tion 4. The justification for this is that an edge whose 
tail is replicated requires no communication, while an edge 
whose head is replicated requires the same amount of com- 
munication regardless of the offset of the (non-replicated) 
tail. 


real A(100), B(100,200) 

do K = 1,200 
A = cos (A) 

B = B + spread(A, dim=2, ncopies=200) 
enddo 

Figure 4: Replication of the array A. 

52 Labeling by network flow 

Recall that we determine offsets independently for each 
template axis. We call the axis we are currently labeling the 
current axis . We must label every port of the ADG either 
“replicated” (R) or “non-replicated” (N). The constraints 
on this labeling are as follows: 

1 . A port for which the current axis is a body axis has 
label N. 

2. The node for a spread along the current axis has its 
input port labeled R and its output port labeled N. 1 

3. A port for a read-only object with a mobile alignment 
in the current axis, and for which the current axis is a 
space axis, has label R. 

4. Some other ports have specified labels, such as ports 
at subroutine boundaries, and ports representing repli- 
cated lookup tables. 

5. At every other node, all ports must have the same 
label. 

Subject to these constraints, we want to complete the 
labeling to minimize replication communication. We as- 
sociate with each ADG edge a weight that is the expected 
total communication cost (over time) of having the tail non- 
replicated and the head replicated; the weight is therefore 
the sum over all iterations of the size of the object commu- 
nicated. 

The object is to complete the labeling, satisfying the 
constraints, and minimizing the sum of the weights of the 
edges directed from N to R ports. We now show that this is 
a min-cut problem and can be solved by standard network 
flow techniques. 

Theorem 1 An optimal replication labeling can be found 
by network flow, 

‘This sounds strange, but it correctly assigns any necessary commu- 
nication to the input edge rather than to the node. Thus a spread node 
performs neither computation nor communication, but just converts a 
replicated object to a higher-dimensional non-replicated one. 
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Proof: We define a weighted, directed graph G, which is 
a slightly modified version of the ADG. The vertices of G 
are as follows: Each node of the ADG except current-axis 
spreads is a vertex of G. If the node has a port labeled N 
or R, the vertex of G has the same label (No node except 
a current-axis spread can have two ports with different la- 
bels.) Each current-axis spread corresponds to two vertices 
of G, one for each port, with the input-port vertex labeled R 
and the output-port vertex labeled N. Finally, G has a new 
source vertex s labeled N and a new sink vertex t labeled 
R. The edges of G are as follows: Each directed edge of the 
ADG corresponds to an edge of G with the same weight. 
Also, there is a directed edge of infinite weight from the 
source s to every vertex with label N, and a directed edge 
of infinite weight from every vertex with label R to the sink 
t. 

A cut in G is a partition of its vertices into two sets X 
and X, with s € X and t 6 X. The cost of a cut is the total 
weight of the edges that cross it in the forward direction, 
that is, the total weight of directed edges (x, y) with x e X 
and j/ e X . 

Every replication labeling is a cut, and the cost of the 
labeling is the same as the cost of the cut. Every cut of 
finite cost is a replication labeling (since no infinite-cost 
edge can cross it in the forward direction), and hence a 
minimum-cost cut is an optimum replication labeling. The 
max flow/min cut theorem [14, Theorem 6.2] says that 
the cost of a minimu m cut is the same as the value of a 
maxim um flow from the source to the sink. □ 

Both the max flow and the min cut can be found in low- 
order polynomial time by any of several algorithms [14, 15]. 
In particular, it can be solved using linear programming. 
This is ideal for us, since we already require a linear pro- 
gramming package for determining mobile offset align- 
ments. This is less efficient asymptotically than other meth- 
ods, but should be adequate for our purposes. 


6 Remarks and Conclusions 

We have presented compiler optimizations for determin- 
ing replication and mobile offsets within loops. We have 
proved that an optimal replication labeling can be found by 
network flow. For mobile alignment, we have presented 
an approximate reduction to rounded linear programming, 
with error bounds on the solution quality. 

We now describe several extensions we are currently 
pursuing. 

The framework for determining mobile offset alignment 
can be extended to handle user-defined runtime functions. 
The idea is to incorporate such functions in the offset align- 
ment, and treat a mismatch in positions as a shift of un- 


known distance. This allows us to use techniques similar 
to those used in Section 4 to solve the problem. 

While we have concentrated on loop programs, our 
framework can in fact deal with arbitrary control flow. 
Static single-assignment form can be constructed for pro- 
grams with arbitrary control flow graphs. In the presence of 
arbitrary control flow, we can use the control dependence 
graph [6] to associate a control weight c e of execution with 
every edge e of the ADG, and minimize the expected re - 
alignment cost 

2 H c *»(0 • *"**(») • d (*»( 0 . *»( Q )- 

Fanout nodes in an ADG represent the possibility of 
Steiner optimization, in which we determine an optimum 
fanout tree for communicating an object from the position 
in which it is defined to the positions in which it is used [5] . 
The fanout node is an approximation to a Steiner tree, which 
should be constructed in a pass after alignments have been 
determined. 

Our replication algorithm does not determine the extent 
of replication for an object. This could be handled after 
replication labeling by propagating lower bounds on such 
extents. The algorithm also does not deal with storage 
allocation issues for replicated objects. In particular, it does 
not deal with the possibility of storing just one copy per 
physical processor rather than a copy per template cell. We 
feel that this decision fits with other storage optimization 
decisions in a separate phase of the compiler. 

A chicken-and-egg situation exists between replication 
labeling and determining mobile offset alignment, as repli- 
cation can be motivated by a mobile alignment for a read- 
only object. Our current proposal is to iterate the replication 
labeling and mobile alignment phases until quiescence. 

The only reason for restricting replication to space axes 
is that we do not yet completely understand the ramifica- 
tions with regard to storage and communication of allowing 
replication in body axes. Extending the notion of replica- 
tion to body axes would provide a more elegant theory. 

We do not, however, foresee extending the definition of 
alignment to make it a many-to-one mapping (collapsing). 
This complicates the alignment phase, and we feel that it 
is best handled in the distribution phase by mapping some 
template axes to memory. Clearly, there are interactions be- 
tween alignment and distribution, as decisions taken in the 
distribution phase (such as mapping certain template axes 
to memory) can radically alter the assumptions made in the 
alignment phase. We propose handling such interactions 
by iterating the two phases until quiescence. 

We now have a comprehensive theory of alignment anal- 
ysis within a single procedure. Our next major efforts are 
to validate our approach by implementing these techniques. 
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to develop a theory of distribution, and to understand the 
interprocedural aspects of alignment and distribution anal- 
ysis. 
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